Step 0: Load necessary libraries and datasets

Load libraries that we use

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## ── Attaching packages ──────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: sp
## Checking rgeos availability: TRUE
## Linking to GEOS 3.5.1, GDAL 2.1.3, PROJ 4.9.2
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map

Step 1: Load datasets

We consider three datasets: Teen birth data by county Education by county *Unemployment rate by county

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   State = col_character(),
##   County = col_character(),
##   `State FIPS Code` = col_double(),
##   `County FIPS Code` = col_double(),
##   `Combined FIPS Code` = col_double(),
##   `Birth Rate` = col_double(),
##   `Lower Confidence Limit` = col_double(),
##   `Upper Confidence Limit` = col_double()
## )
## Warning: Missing column names filled in: 'X48' [48]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `FIPS Code` = col_character(),
##   State = col_character(),
##   `Area name` = col_character(),
##   `Less than a high school diploma, 1970` = col_number(),
##   `High school diploma only, 1970` = col_number(),
##   `Some college (1-3 years), 1970` = col_number(),
##   `Four years of college or higher, 1970` = col_number(),
##   `Less than a high school diploma, 1980` = col_number(),
##   `High school diploma only, 1980` = col_number(),
##   `Some college (1-3 years), 1980` = col_number(),
##   `Four years of college or higher, 1980` = col_number(),
##   `Less than a high school diploma, 1990` = col_number(),
##   `High school diploma only, 1990` = col_number(),
##   `Some college or associate's degree, 1990` = col_number(),
##   `Bachelor's degree or higher, 1990` = col_number(),
##   `Less than a high school diploma, 2000` = col_number(),
##   `High school diploma only, 2000` = col_number(),
##   `Some college or associate's degree, 2000` = col_number(),
##   `Bachelor's degree or higher, 2000` = col_number(),
##   `Less than a high school diploma, 2013-17` = col_number()
##   # ... with 4 more columns
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
##   State = col_character(),
##   `State Code` = col_character(),
##   Region = col_character(),
##   Division = col_character()
## )

Step 2: Clean datasets for ease of use

Looking at the years that the three datasets have in common, we chose to use 2015 data for teen_birth and Unemployment datases. For Education, we use the 2013-2017 average.

Step 3: Join datasets into a master dataset

We

Step 4: Plotting birth rates against various education rates and unemployment rates

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).

Linear model of birth rate as a function of unemployment rate and education

The education data divides the population into four groups: less than high school, high school, college, beyond college. Since the four categories add up to 100%, it is sufficient to only include three of the four rates in the linear model.

## 
## Call:
## lm(formula = `Birth Rate` ~ UnemploymentRate + Less_than_high_school + 
##     high_school + college, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.624  -5.317  -0.643   4.622  49.176 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           28.31067    2.23906  12.644  < 2e-16 ***
## UnemploymentRate       0.68262    0.08822   7.738 1.35e-14 ***
## Less_than_high_school  1.01869    0.03475  29.316  < 2e-16 ***
## high_school           -0.23215    0.03749  -6.193 6.69e-10 ***
## college               -0.48924    0.03492 -14.011  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.514 on 3128 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5801, Adjusted R-squared:  0.5796 
## F-statistic:  1080 on 4 and 3128 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = `Birth Rate` ~ UnemploymentRate + Less_than_high_school + 
##     high_school + college + Region, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.623  -5.341  -0.588   4.455  53.584 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           27.59481    2.38614  11.565  < 2e-16 ***
## UnemploymentRate       0.61385    0.08798   6.977 3.67e-12 ***
## Less_than_high_school  0.80832    0.03904  20.706  < 2e-16 ***
## high_school           -0.18392    0.04005  -4.593 4.55e-06 ***
## college               -0.48397    0.03724 -12.995  < 2e-16 ***
## RegionNortheast       -3.49239    0.65989  -5.292 1.29e-07 ***
## RegionSouth            4.99573    0.40216  12.422  < 2e-16 ***
## RegionWest             1.45183    0.50883   2.853  0.00436 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.187 on 3125 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.612,  Adjusted R-squared:  0.6112 
## F-statistic: 704.3 on 7 and 3125 DF,  p-value: < 2.2e-16